Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsm: ensure that snapshot writes are eventually retried/released, even if WriteSnapshot fails #5689

Closed

Conversation

jonseymour
Copy link
Contributor

This is my attempt to fix the code review issue I identified in #5686.

The controversial aspect of this fix may be the introduction of a commit lock that ensures that
at most one thread is writing a snapshot for a given shard at a given time. I think the lock
is justified because reasoning about concurrent execution of these code paths seems fraught with
danger. Of course, if there are contrary opinions, lets discuss.

@jonseymour jonseymour changed the title tsm: ensure that snapshots are always released, even if WriteSnapshot fails tsm: ensure that snapshots are eventually released, even if WriteSnapshot fails Feb 15, 2016
@jonseymour
Copy link
Contributor Author

Oops. I need to re-roll with corresponding changes to the tests.

@jonseymour
Copy link
Contributor Author

On reflection, I think this fix is probably incomplete, since I don't think the write for the latest snapshot will do what is required to gather the keys from snapshots that failed to be written on a previous WriteSnapshot() call so we end up losing that data from those snapshots. I need to think about this some more, I think. Will re-open the PR when/if I find a solution.

@jonseymour jonseymour closed this Feb 15, 2016
@jonseymour jonseymour reopened this Feb 15, 2016
@jonseymour jonseymour force-pushed the js-failed-write-snapshot branch 2 times, most recently from df3edd7 to 8257570 Compare February 15, 2016 22:46
@jonseymour
Copy link
Contributor Author

Previously, if a snapshot write failed, the memory was never released and the associated writes were never retried.

Furthermore, if the write error was transient and a subsequent compact was successful, then associated WAL segments would be deleted without ever having been written into TSM files. Such a data loss would not be apparent until influxd was restarted (because the data would still be served from the cache).

Interestingly, these issues made influxd less resilient in the face of transient write errors that apparently do not require a server restart, than in the presence of permanent write errors that do. The reason is that provided the WAL segments are not deleted, the normal influxd restart process will eventually compact the WAL segments associated with failed snapshot writes. However, if the write errors are only transient, then the next successful compaction will delete the WAL segments without them ever having been written into TSM files.

This series should address these issues.

@jonseymour jonseymour changed the title tsm: ensure that snapshots are eventually released, even if WriteSnapshot fails tsm: ensure that snapshot writes are eventually retried/released, even if WriteSnapshot fails Feb 15, 2016
@jonseymour
Copy link
Contributor Author

I have now added some tests for the rollback functionality. The new and revised tests show how the association between a cache snapshot and a slice of WAL segment names is established and how WriteSnapshots() can use this capability to retry previously failed snapshot writes.

@jonseymour jonseymour force-pushed the js-failed-write-snapshot branch 7 times, most recently from 953958f to b65864d Compare February 16, 2016 07:22
@jonseymour
Copy link
Contributor Author

Fixed a thread safety issue in the implementation - PrepareSnapshots() now returns a clone of the snapshots slice, rather than a reference. This allows the snapshotter to modify the contents of the slice without interfering with queries that may be trying to use it.

@jonseymour
Copy link
Contributor Author

note: commits prior to 7059331 are from PR #5778 and will not be delivered by the final version of this PR

@jonseymour
Copy link
Contributor Author

This query output shows how this fix recovered the snapshots that could not be written because of temporary removal of write permission to the shard's TSM directory. When the write permissions were restored, the snapshot count dropped to zero again.

demonstration of fix

The second screen shot is taken after restarting the server, indicating that the full contents of the cache was successfully persisted to TSM files

after start

Log output showing compaction of resulting TSM files:

[tsm1] 2016/02/22 18:00:27 error writing snapshot from compactor: remove /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000070-000000001.tsm.tmp: no such file or directory
[tsm1] 2016/02/22 18:00:27 error writing snapshot: remove /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000070-000000001.tsm.tmp: no such file or directory
[tsm1] 2016/02/22 18:02:57 error writing snapshot from compactor: remove /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000071-000000001.tsm.tmp: no such file or directory
[tsm1] 2016/02/22 18:02:57 error writing snapshot: remove /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000071-000000001.tsm.tmp: no such file or directory
[tsm1] 2016/02/22 18:05:27 beginning level 1 compaction of group 0, 2 TSM files
[tsm1] 2016/02/22 18:05:27 compacting level 1 group (0) /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000072-000000001.tsm (#0)
[tsm1] 2016/02/22 18:05:27 compacting level 1 group (0) /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000073-000000001.tsm (#1)
[tsm1] 2016/02/22 18:05:27 compacted level 1 group (0) into /Users/jonseymour/.influxdb/data/_internal/monitor/1/000000073-000000002.tsm.tmp (#0)
[tsm1] 2016/02/22 18:05:27 compacted level 1 group 0 of 2 files into 1 files in 1.629107ms

@jonseymour
Copy link
Contributor Author

@jwilder - example of a (manual) integration test output which shows that this fix does what it intends to do.

To run the test, do:

   cd tsdb/engine/tsm1

Then run:

   GORACE=history=1000 go test -race cache_race_test.go
The current implementation of Cache.Deduplicate is not thread safe as
revealed by a code review and, seperately, a go race detector test case.

The first race condition arises because entries being deduplicated by the
Cache.Deduplicate method were being read by the cache logic without the
deduplicating thread taking a write lock.

A second race condition arose with one of the initial fixes because the snapshot
map was being updated by the deduplicating thread without a readlock (on the snapshot)
being taken by readers of the snapshot.

This fix works by taking a copy of the snapshot's store (called dirty) while
the write lock of the engine's cache is held. Only the entries that need sorting are
cloned. The other entries are copied by reference.

Deduplication of the slice occurs on the dirty slice, rather than the store slice.
There is no thread safety issue with the dirty slice because only the
snapshotting thread sees that slice. The entries that are updated by the
snapshotting thread are the entries that need sorting but all these entries
are clones which are not visible to any other thread.

When we are done deduplicating, we update the store of the snapshot while holding the
write lock of the engine's cache - not that of the snapshot itself.

Once a snapshot is deduplicated, it is never modified again. In the time
between the snapshot being created and the deduplication finishing, the
snapshot may be modified, however, in these cases the writes are
guarded by a writelock on the engine's cache and the snapshotting thread
does not refer to those values.

The end result is that it is no longer necessary to take any snapshot locks
the lock guarding the engine's cache is sufficient.

The snapshotter does need to take write locks on the engine's cache but only at two points:

* while taking the initial snapshot
* after deduplication is complete when updating the cache's store reference

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
@jonseymour
Copy link
Contributor Author

Rebased to remove "comment only assertions" commit that was removed from another PR.

@jwilder
Copy link
Contributor

jwilder commented Feb 23, 2016

@jonseymour I think supporting the multiple snapshots in the cache is not necessary and is complicating things. I have an alternative fix in #5789 that removes the multiple snapshot support.

@jonseymour
Copy link
Contributor Author

Rebased onto a merge of #5701 and 7f457b8 has equivalent fixes to tip of #5795. It has one extra commit (7a03df2) from #5777 not in #5795.

Also, introduce a currently empty RollbackSnapshot method.

Later, we are going to use a commit pattern where resources are first prepared
and then either committed or rolled back.

Currently, the code doesn't understand the concept of a rollback, but
we will add it later.

For now, we will simply rename the existing methods to their final names and
add the RollbackSnapshot() method as a placeholder.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
…led.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
This lock ensures that at most one thread is actively attempting to
commit the WAL log for a single shard at once.

The assumption here it is that it is not helpful to have more than
one thread attempting to do this and that reasoning about correctness
is much more easily achieved if one only has to consider
a single thread.

The Prepare/Commit or Rollback pattern ensures that the lock will always
be released.

This is not to say that the TSM writing process might not be benefit
from concurrency, only that the concurrency should be managed within
the scope of the shard-wide commit lock.

Note that the acqusition order is commit lock first, then cache mutex.
This ensures the cache is available for writes while the snapshot is
being processed.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Now that there is only one thread allowed to be actively writing
a snapshot at once, we can simplify the the commit logic to simply
removing all active snapshots. It also means we do not need
to scan for the committed snapshot since we are going to purge all of them.

Note that Rollback doesn't touch the snapshot slice - it leaves the
snapshot in place until a subsequent compaction succeeds at which
point it will be removed.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The previous commits ensured that snapshots that failed to write
would eventually be released. However, only the current snapshot
would get written to disk. This commit ensures that the writes for
previously failed snapshots are retried again.

To ensure we cleanup only those WAL segments that we have successfully
compacted, we need to keep track of precisely which WAL segments each
snapshot captures, and retry all the snapshots in the same order.
We only remove WAL segments for snapshots that we successfully write
and stop on the first error we encounter.

When we rollback, we only rollback the snapshots that we failed to
write.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Add some tests to check that new files specified on PrepareSnapshots are
associated with the latest snapshot.
The test now demonstrates that if a snapshot is rolled back, the failed
snapshots (only) will appear in the slice returned by the next
PrepareSnapshots call.

The test demonstrates how the discovery of new WAL segment leads to the
new files being associated with the latest snapshot.

The test demonstrates that when retrying multiple snapshots, we
can successfully commit partial progress, even if we need to rollback
one of the later snapshots.
A test to check that the commit lock is working as expected. In particular,
we want to check that if there is a pending commit, a prepare issued
by a concurrent goroutine will block until such time as the
current holder of the lock either commits or rolls back.

This test also checks that concurrent writes into the cache
are not prevented by goroutines that hold the cache's commit lock.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
@jwilder
Copy link
Contributor

jwilder commented Feb 23, 2016

@jonseymour I'm closing this in favor of #5789 because it resolves #5686 by removing some unnecessary complexity in the cache. Since we don't need the multiple snapshots, it does not make sense to add additional layers on top of them that adds additional complexity to the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants